Release 10.1A: OpenEdge Development:
Internationalizing Applications
Techniques for working with multi-byte characters
The following techniques might save you time and trouble:
Choosing the appropriate unit of measure
Several Progress 4GL elements, including the
LENGTHfunction,OVERLAYstatement,SUBSTRINGfunction, andSUBSTRINGstatement, let you specify the unit of measure as the character, the byte, or the column. If you choose the wrong unit of measure, you might split or overlay a multi-byte character. Consider the following example:
The example defines a character variable and sets it to a string of seven characters, the fourth of which is double byte. The example then overlays a string of four, single-byte characters on the original string, starting at position one and continuing for four positions. Unfortunately, the unit of measure is the byte (specified by
RAW), so the fourth byte of the second string, which is the characterz, overlays the fourth byte of the original string, which is the lead byte of the double-byte character.Figure 8–5 shows how the z in the second string overlays the lead byte of the double-byte character in the original string.
Figure 8–5: A single-byte character overlaying a lead byte
![]()
All that remains of the multi-byte character is the trail-byte, as shown in Figure 8–6.
Figure 8–6: Result of a single-byte character overlaying a lead byte
![]()
To fix this error, change the unit of measure to
CHARACTER:
The corrected program produces the string shown in Figure 8–7.
Figure 8–7: String produced by an OVERLAY statement whose unit of measure is the character
![]()
Testing character strings for multi-byte characters
To determine whether a character string contains multi-byte characters, use the
LENGTHfunction, which returns the number of characters, bytes, or columns in a string. The syntax is:
stringA character expression. The specified
stringcan contain double-byte characters.typeA character expression that indicates whether you want the length of a string in character units, bytes, or columns. A double-byte character registers as one character unit. The default unit of measurement is character units.
There are three valid types:
CHARACTER,RAW, andCOLUMN. The expression"CHARACTER"indicates that the length is measured in characters, including double-byte characters. The expression"RAW"indicates that the length is measured in bytes. The expression"COLUMN"indicates that the length is measured in columns. If you specify thetypeas a constant expression, OpenEdge validates the type specification at compile time. If you specify thetypeas a variable expression, OpenEdge validates the type specification at run time.raw-expressionA function or variable name that returns a raw value.
To use the technique, call
LENGTHtwice: once with theCHARACTERoption, which returns the length in characters, and once with theRAWoption, which returns the length in bytes. Then, compare the two lengths. If they are equal, the string contains only single-byte characters; otherwise, the string contains at least one multi-byte character.The following examples illustrate the technique The first example tests a character string consisting of one double-byte character. Since the length of the string in characters (1) does not match the length in bytes (2), the example displays
Multi-bytecharactersinthestring:
The second example tests a character string consisting of three single-byte characters. Since the length of the string in characters (3) matches the length in bytes (3), this example displays
Nomulti-bytecharactersinthestring:
Testing for a lead-byte value
The next technique involves testing a byte for a lead-byte value. Lead bytes (and trail bytes) often have special values to distinguish them. Table 8–5 lists the lead-byte and trail-byte values for the multi-byte code pages OpenEdge supports.
Table 8–5: Lead byte and trail byte values Code page Language or standard Lead-byte values Trail-byte values BIG-5 Traditional Chinese 161 through 254 64 through 126
161 through 254 CP949 Korean 129 through 254 65 through 90
97 through 122
129 through 254 CP950 Traditional Chinese 129 through 254 64 through 126
128 through 254 CP1361 Korean 132 through 211
216 through 222
224 through 249 65 through 127
129 through 254 EUCJIS Japanese 142
164 through 254 161 through 254 GB2312 Simplified Chinese 161 through 254 161 through 254 GB18030 1 Extended Chinese – – KSC5601 Korean 161 through 254 161 through 254 SHIFT-JIS Japanese 129 through 159
224 through 252 64 through 126
128 through 252 UTF-8 Unicode 193 through 239 128 through 191
- The GB18030 code page is a multi-byte code page, consisting of one-, two-, and four-byte characters, that extends the GB2312 code page and includes all characters defined in Unicode. Unlike most multi-byte code pages that OpenEdge supports, you cannot use the lead byte of multi-byte characters in the GB18030 code page to determine the character's length. Progress uses the International Components for Unicode (ICU) library to convert characters between the GB18030 code page and Unicode within the OpenEdge GUI client.
You cannot always assume a byte with a lead-byte value is a lead byte, or a byte with a trail-byte value is a trail byte. This is because the possible values for trail bytes overlap those of lead bytes and single bytes. For example, the value 164 can correspond to a lead byte or a trail byte. To determine which it is, you must inspect the string.
To determine if a byte has a lead-byte value, use the
IS-LEAD-BYTEfunction, which evaluates a character expression and returnsYESif the first byte of the first character of the character string has a value within the range permitted for lead bytes. Otherwise,IS-LEAD-BYTEreturnsNO.IS-LEAD-BYTEhas the following syntax:
stringA character expression (a constant, field name, variable name, or any combination of these) whose value is a character.
In the following example,
IS-LEAD-BYTEexamines a string whose first character is single byte. Since the first byte of the first character of the string is not a lead byte, its value is not within the range permitted for lead bytes,IS-LEAD-BYTEreturnsNO, and the example displaysLead:no:
The following example is identical to the preceding example except that the first character of the string is double byte. Since the first byte of the first character of the string is a lead byte, its value falls within the range permitted for lead bytes,
IS-LEAD-BYTEreturnsYES, and the example displaysLead:yes:
|
Copyright © 2005 Progress Software Corporation www.progress.com Voice: (781) 280-4000 Fax: (781) 280-4095 |